-
Notifications
You must be signed in to change notification settings - Fork 235
memoize stringly keys of objects #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When unpacking msgpack objects, store the keys that appear into a memo dictionary to make them unique. This is useful, because for most sizable msgpack files the same keys appear again and again, since many objects have the same "shape" (set of keys). A similar optimization is done in most json deserializers, eg in CPython: https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717 My totally unscientific results: I tried this on two big msgpack files, a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am reporting time spent deserializing and memory use of the resulting data structure. I've included json deserialization numbers as a comparison. The results I get on my old-ish laptop are: wikidata time memory CPython 3.7.5 before 3.42s 1279 MiB CPython 3.7.5 after 3.43s 883 MiB PyPy3 7.2 before 6.44s 1380 MiB PyPy3 7.2 after 4.98s 965 MiB CPython 3.7.5 json 4.13s 887 MiB PyPy3 7.2 json 3.54s 958 MiB reddit CPython 3.7.5 before 5.62s 3412 MiB CPython 3.7.5 after 5.20s 1754 MiB PyPy3 7.2 before 14.72s 3782 MiB PyPy3 7.2 after 8.37s 2086 MiB CPython 3.7.5 json 8.64s 1753 MiB PyPy3 7.2 json 10.52s 2052 MiB For wikidata, there is only a memory improvement on CPython, the time stays the same. For all other three variants (all of reddit, pypy on wikidata) both time and memory improve significantly. The reason for the memory improvements are due to the memoizing, time improves due to better cache locality due to the smaller working set, and less time spent in GC in the case of PyPy.
| PyErr_Format(PyExc_ValueError, "%.100s is not allowed for map key", Py_TYPE(k)->tp_name); | ||
| return -1; | ||
| } | ||
| if (PyUnicode_CheckExact(k) || PyBytes_CheckExact(k)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not make a big difference, but this can be combined with the previous condition in l.194 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I fixed this.
| ctx.user.encoding = encoding | ||
| ctx.user.unicode_errors = unicode_errors | ||
| Py_INCREF(d) | ||
| ctx.user.memo = <PyObject*>d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK <object>d will enable refcounting in Cython. Otherwise object and PyObject* are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I didn't manage to follow this suggestion, because the memo field is in a C struct, which doesn't support object fields.
|
I prefer interning because it speed up string comparison too. |
|
Isn't interning a bit dangerous, i.e., those strings are there forever until the interpreter exits? |
|
PyUnicode_InternImmortal() creates immortal string. It is bit dangerous as you said. |
|
BTW, No need to care about Python 2 and bytes object. |
|
Interesting:
So it's been like that for a very long time and that story has just been retold again and again since. |
|
OK, happy to switch to interning in the C version. I would still stick with an explicit dict in the fallback version if that's OK? |
|
Ideally fallback would behave identical to native code. Since you're a PyPy developer I suppose it doesn't play nicely there? |
What's wrong about |
|
Ok, I investigated, seems PyPy is also good about intern not leaking anything, I wasn't sure :-). |
When unpacking msgpack objects, store the keys that appear into a memo
dictionary to make them unique. This is useful, because for most sizable
msgpack files the same keys appear again and again, since many objects
have the same "shape" (set of keys). A similar optimization is done in
most json deserializers, eg in CPython:
https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717
My totally unscientific results: I tried this on two big msgpack files,
a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am
reporting time spent deserializing and memory use of the resulting data
structure. I've included json deserialization numbers as a comparison.
The results I get on my old-ish laptop are:
For wikidata, there is only a memory improvement on CPython, the time
stays the same. For all other three variants (all of reddit, pypy on
wikidata) both time and memory improve significantly. The reason for the
memory improvements are due to the memoizing, time improves due to
better cache locality due to the smaller working set, and less time
spent in GC in the case of PyPy.